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Abstract 

Grammar based compression, where one replaces a long string by a small context-free grammar that 
generates the string, is a simple and powerful paradigm that captures many of the popular compression 
schemes, including the Lempel-Ziv family, Run-Length Encoding, Byte-Pair Encoding, Sequitur, and 
Re-Pair. In this paper, we present a novel grammar representation that allows efficient random access 
to any character or substring of S without decompressing S. 

Let S be a string of length N compressed into a context-free grammar S of size n. We present two 
representations of S achieving either 0(n) construction time and space and 0(log N log log TV) random 
access time, or 0(n ■ otk(n)) construction time and space and 0(log TV) random access time. Here, Qfe(n) 
is the inverse of the k th row of Ackermann's function. Our representations extend to efficiently support 
decompression of any substring in S. Namely, we can decompress any substring of length m in the same 
complexity as a random access query and additional 0(m) time. Combining this with fast algorithms for 
standard uncompressed approximate string matching leads to several efficient algorithms for approximate 
string matching within grammar compressed strings without decompression. For instance, we can find 
all approximate occurrences of a pattern P with at most k errors in time 0(n(mm{\P\k, k 4 + \P\} + 
log TV) + occ), where occ is the number of occurrences of P in S. 

All of the above bounds significantly improve the currently best known results. To achieve these 
bounds, we introduce several new techniques and data structures of independent interest, including a 
predecessor data structure, a weighted ancestor data structure, and a compact representation of heavy- 
paths in grammars. 

1 Introduction 

Modern text databases, e.g. for biological and World Wide Web data, are huge. A typical way of storing 
them efficiently is to keep them in compressed form. The challenge arises when we want to access a small part 
of the data or search within it, e.g., if we want to retrieve a particular DNA sequence from a large collection 
of DNA sequences or search for approximate matches of a newly discovered DNA sequence. The naive way 
of achieving this would be to first decompress the entire data and then search within the uncompressed data. 
However, such decompression can be highly wasteful in terms of both time and space. Instead we want to 
support this functionality directly on the compressed data. 

We focus on the following primitives. Let S be a string of length TV given in a compressed representation 
S of size n. The random access problem is to compactly represent S while supporting fast random access 
queries, that is, given an index i, 1 < i < TV, report S[i]. More generally, we want to support substring 
decompression, that is, given a pair of indices i and j, 1 < i < j < N, report the substring S[i] ■ ■ ■ S[j]. 
The goal is to use little space for the representation of S while supporting fast random access and substring 
decompression. Once we obtain an efficient substring decompression method, it can then serve as a basis for 
a compressed version of classical pattern matching. Namely, given an (uncompressed) pattern string P and 
5, the compressed pattern matching problem is to find all occurrences of P within S without decompressing 
S. The goal here is to search more efficiently than to naively decompress S into S and then search for P 
in S. An important variant of the pattern matching problem is when we allow approximate matching (i.e., 
when P is allowed to appear in S with some errors). 
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(a) (b) (c) 

Figure 1: (a) A context-free grammar generating the string abaababa. (b) The corresponding parse tree, (c) 
The acyclic graph defined by the grammar. 



We consider these problems in the context of grammar-based compression, where one replaces a long 
string by a small context-free grammar (CFG) that generates this string (see Fig. [Ha)). Such grammars 
that generate only one unique string capture many of the popular compression schemes including the Lempel- 
Ziv family [SJJlMlliS] , Sequitur gT], Run-Length Encoding, Byte-Pair Encoding [USS], Re-Pair and 
many more [B - 8 30 31 53 . All of these are or can be transformed into equivalent grammar-based compression 
schemes with no or little expansion [15,43 . In general, the size of the grammar, defined as the total number 
of symbol in all derivation rules, can be exponentially smaller than the string it generates. 

Grammar based compression dates back to at least the 1970s 48,49,54,55 . Since then, it has become 
popular in diverse areas outside of data compression including analysis of DNA [32j|40] , music [41] , natural 
language [20] , and complexity analysis of sequences [15J43][54] . In the algorithmic perspective, the properties 
of compressed data were exploited for accelerating the solutions to classical problems on strings. In particular, 
compression was employed to accelerate both exact pattern matching 4 29 ,35, 37 47 and approximate pattern 
matching [3[irrj^[rgi[23{2i[36l[38] . 

Our Results 

In this paper we present a new representation of CFGs that supports efficient random access and substring 
decompression. Let otk(n) be the inverse of the k th row of Ackermann's functional We show the following. 

Theorem 1 For a CFG S of size n representing a string of length A we can support random access 

(i) in time 0(log A log log A) after 0(n) preprocessing time and space, or 

(ii) in time 0(log A) after 0(n ■ otk(n)) preprocessing time and space for any fixed k. 

Secondly, we consider the substring decompression problem. Clearly it is possible to decompress a substring 
of length m by doing m random access queries. We show that this can actually be done in the same 
complexity as a single random access query plus 0{m) time as stated by the following theorem. 

Theorem 2 For a CFG S of size n representing a string of length A we can decompress a substring of 
length m 

(i) in time 0(m + log Aloglog A) after 0(n) preprocessing time and space, or 
(ii) in time 0(m + log A) after 0(n ■ afe(n)) preprocessing time and space for any fixed k. 

The inverse Ackermann function a^(n) can be defined by otk(n) = 1 + afc(ofe_i(n)) so that a\(n) = ra/2, a2(n) = logn, 
03(71) = log* n, Q4(n) = log** n and so on. Here, log** n is the number of times the log* function is applied to n to produce a 
constant. 
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Finally, we show how to combine Theorem [5] with any black-box (uncompressed) approximate string match- 
ing algorithm to solve the corresponding compressed approximate string matching problem over grammar- 
compressed strings. We obtain the following connection between classical (uncompressed) and grammar 
compressed approximate string matching. Let t(m) and s(m) be the time and space bounds of some (un- 
compressed) approximate string matching algorithm on strings of lengths 0(m), and let occ be the number 
of occurrences of P in S. 

Theorem 3 Given a CFG S of size n representing a string of length N and a string P of length m we can 
find all approximate occurrences of P in S 

(i) in time 0(n(m + t(m) + log N log log N) + occ) and space 0(n + m + s(m) + occ), or 

(ii) in time 0(n(m + t(m) + log N) + occ) and space 0(n ■ a^(n) + m + s(m) + occ). 

We next describe how our work relates to existing results. We assume without loss of generality that the 
grammars are in fact straight-line programs (SLPs) and so on the righthand side of each grammar rule there 
are either exactly two variables or one terminal symbol. We further make the assumption that log N bits fit 
in a machine word. This is a fair assumption as the input i to a random access query is also of log N bits. 

Related Work 

The random access problem. It is easy to derive the following two simple trade-offs. If we use O(N) 
space we can access any character in constant time by storing S explicitly in an array. Alternatively, if we 
are willing to settle for 0(n) random access time then the space can be reduced to 0(n) as well. To achieve 
this, we compute and store the sizes of strings derived by each grammar symbol in S. This only requires 
0(n) space and allows to simulate a top-down search expanding the grammar's derivation tree in constant 
time per node. Consequently, a random access takes time 0(h), where h is the height of the derivation tree 
and can be Q(n). Surprisingly, the only known improvement to these trivial bounds is a recent succinct 
representation of grammars, due to Claude and Navarro [T7|. They reduce the space from 0(n log N) bits 
to 0(n log ri) + n log N bits at the cost of increasing the query time to 0(h\ogn). 

The substring decompression problem. Using the simple random access trade-off we get an 0(n) 
space solution that supports substring decompression in 0(hm) time. Gasieniec et al. [2UE5] showed how 
to improve the decompression time to 0(h + to) while maintaining 0{n) space. Furthermore, the above 
mentioned succinct representation of Claude and Navarro [17j supports substring decompression in time 
0((m + h) logn). 

The compressed pattern matching problem. Recall that in approximate pattern matching, we are 
given two strings P and S and an error threshold k. The goal is then to find all ending positions of substrings 
of S that are similar to P up to k errors. There are various ways to define k errors. Perhaps the most popular 
one is the edit distance metric, where fc is a bound on the number of insertions, deletions, and substitutions 
needed to convert one substring to the other. 

In classical (uncompressed) approximate pattern matching, a simple dynamic programming solution of 
Sellers [35] solves this problem (under edit distance) in O(Nm) time and 0(m) space, where N and m are 
the lengths of S and P respectively. Several improvements of this result are known, see e.g., the survey 
by Navarro [39j . Two well known improvements for small values of k are the O(Nk) time algorithm of 
Landau and Vishkin [35] and the 0(Nk 4 /m + N) time algorithm of Cole and Hariharan [IS]. Both of 
these can be implemented in 0(m) space. The use of compression led to many speedups using various 
compression schemes [3, 9, 12-14, 19, 27-29, 36, 38]. Many of these speedups are only for checking whether 
P is similar to S (rather than asking about all substrings of S). The most closely related to our work is 
approximate pattern matching for LZ78 and LZW compressed strings 12,28,38J, which can be solved in time 
0(n(min{TO.fc, k A + to}) + occ) [12]. Here, n corresponds to the compressed length under the LZ compression. 
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Theorem [3] gives us the first non-trivial algorithms for approximate pattern matching over any grammar 
compressed string. For instance, if we plug in the Landau- Viskin or Cole-Hariharan algorithms in Theo- 
rem[3Jii) we obtain an algorithm of 0(n(min{mfc, fc 4 + m} + log jV) + occ) time and 0(n ■ ctk{n) + m + occ) 
space. It is important to note that any algorithm (not only Landau- Viskin or Cole-Hariharan) and any 
similarity measure (not only edit distance) can be applied to Theorem [31 For example, under the Hamming 
distance measure we can combine our algorithm with a fast algorithm for the (uncompressed) approximate 
string matching problem for the Hamming distance measure [5]. 

Outline and Techniques 

Before diving into the details, we give an outline of the paper and of the new techniques and data structures 
that we introduce and believe to be of independent interest. 

Let S be a SLP of size n representing a string of length N. We begin in Section [3] by defining a forest 
H of size n that represents the heavy paths in the parse tree of S. We then combine the forest H with 
an existing weighted ancestor data structurqj, leading to a solution to the random access problem in the 
bounds of Theorem [TJi) . 

The main part of the paper focuses on reducing the random access time to 0(log N). This is achieved by 
designing a new weighted ancestor data structure. In Section |3j we describe the building block of this data 
structure - the interval-biased search tree. An interval-biased search tree is a new and simple linear time 
constructible predecessor data structure. The query complexity of predecessor (p) on this data structure is 
proportional to the integer | successor (p) - predecessor(p)|. This is designed in such a way that if we assign 
one interval-biased search tree for every root-to-leaf path in the representation H then a weighted ancestor 
query on H translates to O(logiV) predecessor queries whose total time sums up to 0(log N). However, this 
is at the cost of 0(n 2 ) preprocessing time and space as there can be 0(n) root-to-leaf paths each assigned to 
an 0(n)-sized interval-biased search tree. The goal of Sectionals then to reduce this quadratic preprocessing 
to be only an inverse Ackermann factor away from linear. 

To achieve this, we utilize the overlaps between root-to-leaf paths in H as captured by another heavy- 
path decomposition, this time of H itself. Assigning one interval-biased search tree for every disjoint heavy 
path in this decomposition requires only 0(n) total preprocessing. We are then left with the problem of 
navigating between these paths during a random access query. We show that this translates to a weighted 
ancestor query on a related tree L. The tree L has a vertex for each path in the decomposition of H and an 
edge for each adjacent paths. While the length of a root-to-leaf path in H can be arbitrary, in L it is always 
bounded by O(logn). This fact is enough to reduce the preprocessing to O(nlogn). 

To further reduce the preprocessing, we partition L into disjoint trees in the spirit of Alstrup, Husfeldt, 
and Rauhe's decomposition 2 for solving the marked ancestor problem. One of these trees has 0(n/logn) 
leaves and so we can apply the solution above for 0(n) preprocessing. The other trees all have O(logn) 
leaves and we handle them recursively. However, before we can recurse on these trees they are modified so 
that each has O(logn) vertices (rather than leaves). This is done by another type of path decomposition (i.e. 
not a heavy-path decomposition) of L. By carefully choosing the desired sizes of the recursion subproblems 
we get the bounds of Theorem QJii) . 

We extend both random access solutions to the substring decompression problem in Section [5j To do 
this efficiently we augment our data structures with a linear number of pointers. These pointers allow us to 
compute the roots of the subtrees of the parse tree that must be expanded to produce the desired substring 
using only two random access computations. The subsequent expansion of the substring can be done in 
linear time in the length of the substring, leading to the bounds of Theorem [2] 

Finally, in Section [S] we combine our substring decompression result with a technique for compressed 
approximate string matching on LZ78 and LZW compressed string 12J to obtain an algorithm for grammar 
compressed strings. The algorithm computes the approximate occurrences of the pattern in a single bottom- 
up traversal of the grammar. At each step we use the substring decompression algorithm to decode a relevant 
small portion of string thereby avoiding a full decompression. 

2 A weighted ancestor query (v,p) asks for the lowest ancestor of v whose weighted distance from v is at least p. 
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2 Fast Random Access in Linear- Space 



In the rest of the paper, we let S denote an SLP of size n representing a string of length N, and let T 
be the corresponding parse tree (see Fig. [T{b)). Below we present an 0(n) space representation of S that 
supports random access in 0(log log log N) time, leading to Theorem [TJi) . To achieve this we partition S 
into disjoint paths according to a heavy path decomposition |26| of the parse tree T and combine this with 
fast predecessor data structures to search in these heavy paths. This achieves the desired time bound but 
uses 0(n 2 ) preprocessing time and space. We introduce a simple but compact representation of all these 
predecessor data structures that reduces the space to 0(n). 

2.1 Heavy Path Decompositions 

Similar to Harel and Tarjan |26) . we define the heavy path decomposition of the parse tree T as follows. For 
each node v define T(v) to be the subtree rooted at v and let size(u) be the number of descendant leaves of 
v. We classify each node in T as either heavy or light. The root is light. For each internal node v we pick 
a child of maximum size and classify it as heavy. The heavy child of v is denoted heavy (u). The remaining 
children are light. An edge to a light child is a light edge and an edge to a heavy child is a heavy edge. 
Removing the light edges we partition T into heavy paths. A heavy path suffix is a simple path V\ , . . . , vj. 
from a node v\ to a leaf in T(vi), such that = heavy(uj), for i = 1, . . . , k — 1. 

If u is a light child of v then size(u) < size(w)/2 since otherwise u would be heavy. Consequently, the 
number of light edges on a path from the root to a leaf is at most 0(log N) [2B]. Note that our definition of 
heavy paths is slightly different than the usual one. We construct our heavy paths according to the number 
of leaves of the subtrees and not the total number nodes. 

We extend heavy path decomposition of trees to SLPs in a straightforward manner. We consider each 
grammar variable v as a node in the directed acyclic graph defined by the grammar (see Fig. QJc)). For a 
node v in S let S(v) be the substring induced by the parse tree rooted at v and define the size of v to be 
the length of S(v). We define the heavy paths in S as in T from the size of each node. Since the size of a 
node v in S is the number of leaves in T(v) the heavy paths are well-defined and we may reuse all of the 
terminology for trees on SLPs. In a single 0(n) time bottom-up traversal of S we can compute the sizes of 
all nodes and hence the heavy path decomposition of S. 

2.2 Fast Random Access 

We first give an 0(log A" log log N ) time and 0(n 2 ) preprocessing time and space solution. Our data structure 
represents the following information for each heavy path suffix v\ 1 . . . , Vk in S. 

• The length size(ui) of the string S(v\). 

• The index z of Vk in the left-to- right order of the leaves in T(v\) and the character iS^i)^]. 

• A predecessor data structure for the left size sequence Iq, l\, . . . , 1^, where U is the sum of 1 plus the 
sizes of the left and light children of the first i nodes in the heavy path suffix. 

• A predecessor data structure for the right size sequence r$, . . . ,r/j, where is the sum of I plus the 
sizes of the right and light children of the first i nodes in the heavy path suffix. 

With the above information we can perform a top down search of T as follows. Suppose that we have 
reached node v\ with heavy path suffix v%, . . . , Vk and our goal is to access the character S(vi)[p). We then 
compare p with the index z of Vk- There are three cases: 

1. If p = z we report the stored character S l (vi)[z] and end the search. 

2. If p < z we compute the predecessor k of p in the left size sequence. We continue the top down search 
from the left child u of Wj+i. The position of p in T{u) is p — li + 1. 



5 



/,) 
Jl 

h 
h 
h 



size(wi) 



1 

4 
4 
5 
5 



7 



r = 1 

n = l 

r 2 = 3 
r 3 = 3 

7 - 4 = 3 



z = 5 




3 



1 1 2 



Figure 2: The left and right size sequences for a heavy path suffix vi t V2, ^3, «4- The dotted edges are to light 
subtrees and the numbers in the bottom are subtree sizes. A search for p = 5 returns the stored character 
for A search for p = 4 computes the predecessor I2 of 4 in the left size sequence. The search 

continues in the left subtree of ^3 for position p — 1% + 1 = 4— 4+1 = 1. A search for p = 6 computes the 
predecessor r\ of 7 — 6 = 1 in the right size sequence. The search continues in the right subtree of for 
position p — z = 6 — 5=1. 



3. If p > z we compute the predecessor fj of size(ui) — p in the right size sequence. We continue the top 
down search from the right child u of Uj+i. The position of p in T(u) is p — (z + X^=i+2 s i ze ( w j)) ( n °te 
that we can compute the sum in constant time as r% — ^+2)- 

An example of the representation and the search algorithm is shown in Fig. [2] The complexity of the 
algorithm depends on the implementation of the predecessor data structures. With a van Emde Boas data 
structure [SO] we can answer predecessor queries in 0(loglog A) time. Our algorithm visits at most O (log AT) 
heavy paths during a top down traversal of the parse tree and therefore we use 0(log A log log N) time to 
do a random access. However, the total length of all heavy path suffixes is f2(n 2 ) (in the worst-case). Hence, 
even with a linear space implementation of the van Emde Boas data structures }52] we use Q(n 2 ) space in 
total. 

2.3 A Linear Space Representation of Heavy Path Suffixes 

We show how to compactly represent all of the predecessor data structures from the algorithm of the previous 
section in 0(n) space. 

We introduce the heavy path suffix forest H of S. The nodes of H are the nodes of S and a node u is the 
parent of v in H iff u is the heavy child of v in S. Thus, a heavy path suffix v\, . . . , Vk in <S is a sequence of 
ancestors from v\ in H . We label the edge from v to its parent u by a left weight and right weight defined 
as follows. If u is the left child of v in S the left weight is and the right weight is size(w') where v' is the 
right child of v. Otherwise, the right weight is and the left weight is size(w') where v' is the left child of v. 
Heavy path suffixes in S consist of unique nodes and therefore H is a forest. A heavy path suffix in S ends 
at one of |S| leaves in S and therefore H consists of |£| trees each rooted at a unique character of E. The 
total size of H is 0(n) and we may easily compute it from the heavy path decomposition of S in 0(n) time. 

A predecessor query on a left size sequence and right size sequence of a heavy path suffix v±,...,Vk is 
now equivalent to a weighted ancestor query on the left weights and right weights of H, respectively. Farach- 
Colton and Muthukrishnan 21 j showed how to support weighted ancestor queries in 0(loglog N) time after 
0(n) space and preprocessing time. Hence, if we plug this in to our algorithm we obtain 0(log A log log N) 
query time with 0(n) preprocessing time and space. In summary, we have the following result. 

Theorem 4 For an SLP S of size n representing a string of length N we can support random access in 
time 0(log A log log N) after 0{n) preprocessing time and space. 
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3 Interval-Biased Search Trees 



We have so far seen how to obtain 0(log ./VloglogiV) random access time on an SLP S. Our goal now is to 
reduce this to O(logiV). Recall that 0(log N log log N) was a result of preforming 0(log N) predecessor (p) 
queries, each in O(loglogTV) time. In this section, we introduce a new predecessor data structure - the 
interval-biased search tree. Each predecessor (p) query on this data structure requires 0(log — ) time, where 
x = successor (p) - predecessor (p). 

To see the advantage of O(log^) predecessor queries over O(loglogTV), suppose that after performing 
the predecessor query on the first heavy path of T we discover that the next heavy path to search is the 
heavy path suffix originating in node u. This means that the first predecessor query takes 0(log rgmj ) 
time. Furthermore, the elements in u's left size sequence (or right size sequence) are all from a universe 
{0, 1, . . . , |5(u)|}. Therefore, the second predecessor query takes O(logl^) where x = \S{u')\ for some 
node u' in T(u). The first two predecessor queries thus require time 0(log rg^jj + log ) = 0(log ^). 
The time required for all O(logiV) predecessor queries telescopes similarly for a total of O(logiV). 

We next show how to construct an interval-biased search tree in linear time and space. Simply using this 
tree as the predecessor data structure for each heavy path suffix of S already results in the following lemma. 

Lemma 1 For an SLP S of size n representing a string of length N we can support random access in time 
O(log-ZV) after 0(n 2 ) preprocessing time and space. 

3.1 A Description of the Tree 

We now define the interval-biased search tree associated with n integers 1% < ... < l n from a universe 
{0, 1, . . . , N}. For simplicity, we add the elements Iq = and l n +i = N. The interval-biased search tree 
is a binary tree that stores the intervals [Iq, li], [l\, I2], ■ ■ ■ , [l n , l n +i] with a single interval in each node. We 
describe the tree recursively as follows: 

1. Let i be such that — h > (l n +i — Iq)/2 (there is at most one such i). 

2. If no such i exists then let i be such that h — Iq < (l n +i — /o) /2 and l n +i — U+i < (l n +i — £o)/2. 

3. The root of the tree stores the interval [li, k+i]. 

4. The left child of the root is the interval-biased search tree storing the intervals [Zo, h], [h, h], ■ ■ ■ , [h-i, h]- 

5. The right is the interval-biased search tree storing the intervals [h+i, U+2], [li+2, h+3], ■ • ■ , [la, 

When we search the tree for a query p and reach a node corresponding to the interval [li, h+i], we 
compare p with li and Zj+i. If U < p < U+\ then we return li as the predecessor. If p < li (resp. p > U+\) we 
continue the search in the left child (resp. right child). Notice that an interval [U, h+i] of length x — U + i — U 
such that < x < is stored in a node of depth at most j. Therefore, a query p whose 

predecessor is li (and whose successor is h+i) terminates at a node of depth at most j. The query time is 
thus j < 1 + log — = 0(log — ) which is exactly what we desire as x = successor (p) - predecessor (p). 

3.2 A Linear-Time Construction 

We still need to describe an efficient 0(n) time and space top-down construction of the interval-biased 
search tree. Before we construct the tree, we initialize a Range Maximum Query (RMQ) data structure over 
the array of interval lengths (l\ — Iq), (I2 — h), ■ ■ ■ , (l n +i — In)- Such a data structure can be constructed 
in 0(n) time and answers RMQ(j,fc) queries in constant time [10l[Tl][22 ( 26, 44 . Here, RMQ(j,fc) returns 
argmax^-^+i - k}. 

We now show how to construct the interval-biased search tree storing the intervals [lj, lj+{\, ■ ■ ■ , [h, ^fc+i]- 
We focus on finding the interval [U, k+i] to be stored in its root. The rest of the tree is constructed recursively 
so that the left child is a tree storing the intervals [lj, ■ ■ ■ , [h-i, h] and the right child is a tree storing 
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the intervals [k+i , h+2] > ■ ■ ■ > [Ikih+i]- To identify the interval [k, h+i], we first query RMQ(j,fc). If the query 
returns an interval of length greater than (lk+i — then this interval is chosen as [U, h+i]. 

Otherwise, we are looking for an interval [U, h+i] such that i is the largest value where U < (lk+i + 
holds. We can find this interval in (9(log(fc — j)) time by doing a binary search for Qk+i + in the 

subarray lj, lj+\, ■ ■ ■ , h+i- However, notice that we are not guaranteed that [Zj, U+i] partitions the intervals 
in the middle. In other words, i — j can be much larger than k — i and vice versa. This means that the total 
time complexity of all the binary searches we do while constructing the entire tree can amount to 0(n log n) 
and we want O(n). To overcome this, notice that we can find [/j, h+i] in min{log(i — j), log(/c — i)} time if we 
use a doubling search from both sides of the subarray. That is, if prior to the binary search, we narrow the 
search space by doing a parallel scan of the elements lj, lj+2, lj+4, lj+a, ■ ■ ■ and Ik, h-2, h-4,t h-&, • • ■• This 
turns out to be crucial for achieving O(n) total construction time as we now show. 

To verify the total construction time, first notice that the range maximum queries require only constant 
time. We therefore need to compute the total time required for all the binary searches. Let T(h) denote 
the time complexity of all the binary searches, then T(h) = T(i) + T(h — 1) + min{logi, log(n — i)} for some 
i. Setting d = min{i,n — 1) < n/2 we get that T(n) = T(d) + T(n — d) + \ogd for some d < h/2, which is 
equaH to 0(h) . 

3.3 Final Tuning 

Our data structure so far requires 0(logf ) time to answer a predecessor(p) query where x = successor(p) 
- predecessor(p). We have seen that using this data structure for each heavy path suffix of S can reduce 
the random access time to O(logTV). However, this is currently at the cost of 0(n 2 ) preprocessing time and 
space. In the next section, we reduce the preprocessing to linear. This will be achieved by utilizing the 
overlaps between the domains of these n data structures. 

For that, we actually need one last important propertjQ from the interval-biased search tree. Suppose 
that right before doing a predecessor (p) query we know that p > Ik for some k. We can make sure that 

the query time is reduced to 0(log N ~ k ). To achieve this, in a single traversal of the tree, we compute for 
each node its lowest common ancestor with the node [In, l n +\\. Then, when searching for p, we can start the 
search in the lowest common ancestor of [lk,lk+i] and [l n ,l n +i] in the interval-biased search tree. 

4 Closing the Time-Space Tradeoffs for Random Access 

We have so far seen two ideas that together offer a time-space tradeoff for random accessing an SLP. The first 
idea was to use an interval-biased search tree as a predecessor data structure. Storing such a data structure 
for each of the n heavy path suffixes yields O (log AT) random access time after 0(n 2 ) preprocessing. The 
second idea was to utilize the overlaps between the domains of these n data structures. These overlaps are 
captured by the heavy path suffix forest H whose size is only 0(n). Using a weighted ancestor data structure 
on H the preprocessing is reduced to 0(n) at the cost of 0(log iV log log N) random access time. 

In this section we aim at closing this gap between 0(n) preprocessing and 0(log N) random access time. 
This will be achieved by designing a novel weighted ancestor data structure on H whose building blocks 
include interval-biased search trees. To do so, we actually perform a heavy path decomposition of H itself. 
For each heavy path P in this decomposition we keep one interval-biased search tree for its left size sequence 
and another one for its right size sequence. It is easy to see that the total size of all these interval-biased 
search trees is bounded by 0(n). We focus on queries of the left size sequence, the right size sequence is 
handled similarly. 

Let P be a heavy path in the decomposition, let v be a vertex on this path, and let w(v,v') be the 
weight of the edge between v and his child v', We denote by b(v) the weight of the part of P below v and 

3 By an inductive assumption that T(n) < 2n — log h — 2 we get that T(h) is at most 2d — log d — 2 + 2(n — d) — log(n — d) — 
2 + log ci = 2n — log(n — d) — 4, which is at most 2n — logn — 3 since d < n/2. 

4 In fact, there exist linear-time-constructable predecessor data structures with query complexity only O(loglog^) [42] , 
They are more complicated than our tree, but more importantly, their query time cannot handle N reducing to N — l^. 
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by t(v) the weight above v. As an example, consider the green heavy path P = (vs-v^-Vs-Vg) in Fig. [31 then 
b{vi) — w(f4, Vg) + w(vg,,vg) and t{v/f) = w(vs, U4). In general, if P = {vk-vt-i-- • ■ then V\ is a leaf in H 
and b(vi + i) is the i'th element in P's predecessor data structure. The &(•) and t(-) values of all vertices can 
easily be computed in 0(n) time. 




Figure 3: The parse tree T of an SLP is depicted on the left, the heavy path suffix forest H is in the middle, 
and the light representation L of H is on the right. The heavy path decomposition of H is marked (in green) 
and defines the vertex set of L. 

Recall that given any vertex u in H and any < p < N we need to be able to find the lowest ancestor v 
of u whose weighted distance from u is at least p. If we want the total random access time to be O(logiV) 
then finding v should be done in Oi log ^ S S U \ ) time where v' is the child of v which is also an ancestor of 

G V ° w(v,v') J 

u. Suppose that both u and v are on the same heavy path P in the decomposition. In this case, a single 
predecessor(p') query on P would indeed find v in Q(log Jff,) ) = Q(log j^jTj ) time, here p' =p + b(u). 
This follows from the property we described in Section 13.31 

The problem is thus to locate v when, in the decomposition of H, v is on the heavy path P but u is not. 
To do so, we first locate a vertex w that is both an ancestor of u and belongs to P. Once w is found, if its 
weighted distance from u is greater than p then v = w. Otherwise, a single predecessor (p") query on P finds 
v in 0(log J^l^ ) time, which is 0( log J^"^ ) since t(w) < |S(tt)|. Here, p" = p - weight(path from utoic 
in H) + b(w). We are therefore only left with the problem of finding w and the weight of the path from u 
to w. 

4.1 A Light Representation of Heavy paths 

In order to navigate from u up to w we introduce the light representation L of H. Intuitively, L is a 
(non-binary) tree that captures the light edges in the heavy-path decomposition of H . Every path P in the 
decomposition of H corresponds to a single vertex P in L, and every light edge in the decomposition of H 
corresponds to an edge in L. If a light edge e in H connects a vertex w with its child then the weight of the 
corresponding edge in L is the original weight of e plus t(w). (See the edge of weight w(v4, V3) + w(v5,i>4) 
in Fig.©. 

The problem of locating w in H now translates to a weighted ancestor query on L. Indeed, if u belongs 
to a heavy-path P' then P' is also a vertex in L and locating w translates to finding the lowest ancestor of 
P' in L whose weighted distance from P' is at least p — t(u). However, a weighted ancestor data structure 
on L would be too costly. Instead, we utilize the important fact that the height of L is only O(logn). This 
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follows from the edges of L corresponding to only light edges of H. 

We can therefore construct, for every root-to-leaf path in L, an interval-biased search tree as its prede- 
cessor data structure. The total time and space for constructing these data structures is O(nlogn). A query 
for finding the ancestor of P' in L whose weighted distance from P' is at least p — t(u) can then be done in 
0(log ^-j^y) time. This is 0(log jjz^pT ) as w(v, v') < t(w). We summarize this with the following lemma. 

Lemma 2 For an SLP S of size n representing a string of length N we can support random access in time 
O(logiV) after O(nlogn) preprocessing time and space. 

4.2 An Inverse- Ackerman Type bound 

We have just seen that after O(nlogn) preprocessing we can support random access in O(logA^) time. 
This superlinear preprocessing originates in the 0(n log n)-sized data structure that we construct on L for 
0(log jf^ u ^ )-time weighted ancestor queries. We now turn to reducing the preprocessing to be arbitrarily 
close to linear by recursively shrinking the size of this weighted ancestor data structure on L. 

In order to do so, we perform a decomposition of L that was originally introduced by Alstrup, Husfeldt, 
and Rauhe 2 for solving the the marked ancestor problem: Given the rooted tree L of n nodes, for every 
maximally high node whose subtree contains no more than logn leaves, we designate the subtree rooted at 
this node a bottom tree. Nodes not in a bottom tree make up the top tree. It is easy to show that the top 
tree has at most n/logn leaves and that this decomposition can be done in linear time. 

Notice that we can afford to construct, for every root-to- leaf path in the top tree, an interval-biased search 
tree as its predecessor data structure. This is because there will be only n/logn such data structures and 
each is of size height (X) = O(logn). In this way, a weighted ancestor query that originates in a top tree 
node takes 0( log ) time as required. The problem is therefore handling queries originating in bottom 

trees. 

To handle such queries, we would like to recursively apply our 0(n log n) weighted ancestor data structure 
on each one of the bottom trees. This would work nicely if the number of nodes in a bottom tree was 0(log n). 
Unfortunately, we only know this about the number of its leaves. We therefore use a branching representation 
B for each bottom tree. The number of nodes in the representation B is indeed logn and it is defined as 
follows. 

We first partition L into disjoint paths according to the following rule: A node v belongs to the same 
path as its child unless v is a branching- node (has more than one child). We associate each path P in this 
decomposition with a unique interval-biased search tree as its predecessor's data structure. The branching 
representation B is defined as follows. Every path P corresponds to a single node in B. An edge e connecting 
path P' with its parent-path P corresponds to an edge in B whose weight is e's original weight plus the total 
weighted length of the path P' (See Fig. 2]). 

On the left is some bottom tree - a weighted tree with 
logn leaves. The bottom tree can be decomposed 
into logn paths (marked in red) each with at most 
one branching node. Replacing each such path with 
a single node we get the branching representation 
B as depicted on the right. The edge- weight 17 is 
obtained by the original weight 3 plus the weighted 
path 2+7+5. 

Figure 4: A bottom tree and its branching representation B 




17 



to 



H I* 



in leaves logn nodes 
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Each internal node in B has at least two children and therefore the number of nodes in B is O(logn). 
Furthermore, similarly to Section [4.1[ our only remaining problem is weighted ancestor queries on B. Once 
the correct node is found in B, we can query the interval-biased search tree of its corresponding path in L 
in Q(log J,^, 1 ) ) time as required. 

Now that we can capture a bottom tree with its branching representation B of logarithmic size, we 
could simply use our O(nlogn) weighted ancestor data structure on every B. This would require an 
0(lognloglogn)-time construction for each one of the n/logrt bottom trees for a total of 0(n log log n) 
construction time. In addition, every bottom tree node v stores its weighted distance d(v) from the root of 
its bottom tree. After this preprocessing, upon query v, we first check d(v) to see whether the target node 
is in the bottom tree or the top tree. Then, a single predecessor query on the (bottom or top) tree takes 
0(log i S (i^) ) time as required. 

It follows that we can now support random access on an SLP in time O(logiV) after only 0(n log log n) 
preprocessing. In a similar manner we can use this O(nloglogn) preprocessing recursively on every B to 
obtain an 0(n log log log n) solution. Consequently, we can reduce the preprocessing to 0(nlog*n) while 
maintaining O(logiV) random access. Notice that if we do this naively then the query time increases by a 
log* n factor due to the log* n d(v) values we have to check. To avoid this, we simply use an interval-biased 
search tree for every root-to-leaf path of log* n d(v) values. This only requires an additional 0(n\og* n) 
preprocessing and the entire query remains C?(log ) • 

Finally, we note that choosing the recursive sizes more carefully (in the spirit of [11116]) can reduce the 
log* n factor down to ak(n) for any fixed k. In summary, we have the following result. 

Theorem 5 For an SLP S of size n representing a string of length N we can support random access in 
time OilogN) after 0(n ■ ctk{n)) preprocessing time and space for any fixed k. 

Combining Theorems [4] and [5] gives us Theorem [TJ 



5 Substring Decompression 

We now extend our random access solutions from the previous section to efficiently support substring decom- 
pression. Note that we can always decompress a substring of length m using m random access computations. 
In this section we show how to do it using just 2 random access computations and additional 0{m) time. 
This immediately implies Theorem [2J 

We the extend the representation of S as follows. For each node v in S we add a pointer to the next 
descendant node on the heavy path suffix for v whose light child is to the left of the heavy path suffix and 
to the right of the heavy path suffix, respectively. This increases the space of the data structure by only 
a constant factor. Furthermore, we may compute these pointers during the construction of the heavy path 
decomposition of S without increasing the asymptotic complexity. 

We decompress a substring S[i, j] of length m = j — i as follows. First, we compute the lowest common 
ancestor v of the search paths for i and j by doing a top-down search for i and j in parallel. We then continue 
the search for i and j independently. Along each heavy-path on the search for i we collect all subtrees to 
the left of the heavy path in a linked list using the above pointers. The concatenation of the linked list is 
the roots of subtrees to left of the search path from v to i. Similarly, we compute the linked list of subtrees 
to the right of the search path from v to j. Finally, we decode the subtrees from the linked lists thereby 
producing the string S[i,j]. 

With our added pointers we construct the linked lists in time proportional to the length of the lists which 
is O(m). Decoding each subtree uses time proportional to the size of the subtree. The total sizes of the 
subtrees is 0(m) and therefore decoding also takes 0(m) time. Adding the time for the two random access 
computations for i and j we obtain Theorem [2] 
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6 Compressed Approximate String Matching 



We now show how to efficiently solve the compressed approximate string matching problem for grammar 
compressed strings. Let P and be string of length m and let k be an error threshold. We assume that the 
algorithms for the uncompressed problem produces the matches in sorted (as is the case for all solution that 
we are aware of). Otherwise, additional time for sorting should be included in the bounds. 

To find all approximate occurrences of P within S without decompressing S we combine our substring de- 
compression solution from the previous section with a technique for compressed approximate string matching 
on LZ78 and LZW compressed string [12] . 

We find the occurrences of P in S in a single bottom-up traversal of S using an algorithm for (un- 
compressed) approximate string matching as a black-box. At each node v in S we compute the matches 
of P in S(v). If v is a leaf we decompress the single character string S(v) in constant time and run our 
approximate string matching algorithm. Otherwise, suppose that v has left child vi and right child v r . We 
have that S(v) = S(vi) ■ S(v r ). We decompress the substring S' of S(v) consisting of the mm{\S(vi)\,m + k} 
last characters of S(vi) and the min{|5'(w r )|, m + k} first characters of S(v r ) and run our approximate string 
matching algorithm on P and S". We compute the set of matches of P in S(v) by merging the list of matches 
from the matches of P in S(vi), S(v r ), S' (we assume here that our approximate string matching algorithm 
produces list of matches in sorted order). This suffices since any approximate match with at most k errors 
starting in S(vi) and ending in S(v r ) must be contained within S". 

For each node v in S we decompress a substring of length 0(m + k) = 0(m), solve an approximate string 
matching problem between two strings of length 0(m), and merge lists of matches. Since there are n nodes 
in S we do n substrings decompression and approximate string matching computations on strings of length 
m in total. The merging is done on disjoint matches in S and therefore takes O(occ) time, where occ is 
the total number of matches of P in S. With our substring decompression result from Theorem [5] and an 
arbitrary approximate string matching algorithm we obtain Theorem [3] 
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